Statistical learning: tree-based methods

MACS 30100
University of Chicago

March 1, 2017

Hyperplanes

  • Two-dimensional hyperplane

    \[\beta_0 + \beta_1 X_1 + \beta_2 X_2 = 0\]

  • \(p\)-dimensional hyperplane

    \[\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p = 0\]

  • Points on the hyperplane

    \[\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p > 0\]

    \[\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p < 0\]

Classification using a separating hyperplane

\[x_1 = \begin{pmatrix} x_{11} \\ \vdots \\ x_{1p} \end{pmatrix}, \dots, x_n = \begin{pmatrix} x_{n1} \\ \vdots \\ x_{np} \end{pmatrix}\]

  • \(y_1, \dots, y_n \in \{-1, 1 \}\)
  • Test observation \(x^*\)
  • Separating hyperplane

    \[\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} > 0, \text{if } y_i = 1\] \[\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} < 0, \text{if } y_i = -1\]

Maximal margin classifier

Maximal margin classifier

Maximal margin hyperplane

\(n\) training observations with predictors \(x_1, \dots, x_n \in \mathbb{R}^p\) and associated class labels \(y_1, \dots, y_n \in \{-1, 1\}\)

\[\begin{aligned} & \underset{\beta_0, \beta_1, \dots, \beta_p}{\text{maximize}} & & M \\ & \text{s.t.} & & \sum_{j=1}^p \beta_j^2 = 1, \\ & & & y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}) \geq M \; \forall \; i = 1, \dots, n \\ \end{aligned}\]

Support vector classifier

  • Imperfect separating hyperplane
  • Why do this?
    • Perfect separating hyperplane doesn’t exist
    • More confidence in predictions for majority of observations
    • Avoid overfitting the data

Support vector classifier

  • Allows observations to exist on the wrong side of the margin
  • Allows observations to exist on the wrong side of the hyperplane

Support vector classifier

\[\begin{aligned} & \underset{\beta_0, \beta_1, \dots, \beta_p, \epsilon_1, \dots, \epsilon_n}{\text{maximize}} & & M \\ & \text{s.t.} & & \sum_{j=1}^p \beta_j^2 = 1, \\ & & & y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}) \geq M(1 - \epsilon_i), \\ & & & \epsilon_i \geq 0, \sum_{i = 1}^n \epsilon_i \leq C \\ \end{aligned}\]

  • Purpose of \(\epsilon_i\)
  • Role of \(C\)
    • \(C = 0\)
    • Selecting \(C\)

Non-linear decision boundaries

Adding quadratic terms

\[X_1, X_1^2, X_2, X_2^2, \dots, X_p, X_p^2\]

\[\begin{aligned} & \underset{\beta_0, \beta_{11}, \beta_{12}, \dots, \beta_{p1}, \beta_{p2}, \epsilon_1, \dots, \epsilon_n}{\text{maximize}} & & M \\ & \text{s.t.} & & y_i \left( \beta_0 + \sum_{j = 1}^p \beta_{j1} x_{ij} + \sum_{j = 1}^p \beta_{j2} x_{ij}^2 \right) \geq M(1 - \epsilon_i), \\ & & & \epsilon_i \geq 0, \sum_{i = 1}^n \epsilon_i \leq C, \sum_{j = 1}^p \sum_{k = 1}^2 \beta_{jk}^2 = 1 \\ \end{aligned}\]

Support vector classifier

  • Inner product: \(\langle a,b \rangle = \sum_{i = 1}^r a_i b_i\)

    ## [1] 1 2 3 4 5
    ## [1] 1 2 3 4 5
    ##      [,1]
    ## [1,]   55

    \[\langle x_i, x_{i'} \rangle = \sum_{j = 1}^p x_{ij} x_{i'j}\]

  • Linear support vector

    \[f(x) = \beta_0 + \sum_{i = 1}^n \alpha_i \langle x, x_i \rangle\]

    \[f(x) = \beta_0 + \sum_{i \in \mathbb{S}} \alpha_i \langle x, x_i \rangle\]

Kernels

  • Generalization of the inner product

    \[K(x_i, x_{i'})\]

  • Linear kernel

    \[K(x_i, x_{i'}) = \sum_{j = 1}^p x_{ij} x_{i'j}\]

Polynomial kernel

\[K(x_i, x_{i'}) = (1 + \sum_{j = 1}^p x_{ij} x_{i'j})^d\]

\[f(x) = \beta_0 + \sum_{i \in \mathbb{S}} \alpha_i K(x,x_i)\]

Polynomial kernel

Polynomial kernel

Radial kernel

\[K(x_i, x_{i'}) = \exp(- \gamma \sum_{j=1}^p (x_{ij} - x_{i'j})^2)\]

Why use kernels

  • Does not enlarge feature space \(p\)
  • Compute \(K(x_i, x_{i'})\) for all \(\binom{n}{2}\) distinct pairs \(i, i'\)

Applying and interpreting SVMs

  • Predictive models
  • Not good for inference
  • How to interpret

Titanic - linear SVM

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.364 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03 0.400     0.0686
## 2 1e-02 0.390     0.0779
## 3 1e-01 0.376     0.0729
## 4 1e+00 0.366     0.0760
## 5 5e+00 0.364     0.0735
## 6 1e+01 0.364     0.0735
## 7 1e+02 0.364     0.0735
## 
## Call:
## best.tune(method = svm, train.x = Survived ~ Age + Fare, data = as_tibble(titanic_split$train), 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  5 
##       gamma:  0.5 
## 
## Number of Support Vectors:  372
## 
##  ( 186 186 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

## Area under the curve: 0.759

Titanic - polynomial SVM

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   cost
##  0.001
## 
## - best performance: 0.394 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03 0.394     0.0844
## 2 1e-02 0.398     0.0872
## 3 1e-01 0.396     0.0863
## 4 1e+00 0.398     0.0851
## 5 5e+00 0.396     0.0858
## 6 1e+01 0.398     0.0851
## 7 1e+02 0.396     0.0858
## 
## Call:
## best.tune(method = svm, train.x = Survived ~ Age + Fare, data = as_tibble(titanic_split$train), 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), 
##     kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  0.001 
##      degree:  3 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  397
## 
##  ( 198 199 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

## Area under the curve: 0.724

Titanic - radial SVM

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   100
## 
## - best performance: 0.326 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03 0.400     0.0625
## 2 1e-02 0.400     0.0625
## 3 1e-01 0.358     0.0830
## 4 1e+00 0.342     0.0819
## 5 5e+00 0.330     0.0634
## 6 1e+01 0.328     0.0681
## 7 1e+02 0.326     0.0700
## 
## Call:
## best.tune(method = svm, train.x = Survived ~ Age + Fare, data = as_tibble(titanic_split$train), 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  100 
##       gamma:  0.5 
## 
## Number of Support Vectors:  334
## 
##  ( 165 169 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

## Area under the curve: 0.7

Titanic - ROC curves

Voter turnout

## # A tibble: 1,165 × 8
##    vote96 mhealth_sum   age  educ  black female married inc10
##    <fctr>       <dbl> <dbl> <dbl> <fctr> <fctr>  <fctr> <dbl>
## 1       1           0    60    12      0      0       0  4.81
## 2       1           1    36    12      0      0       1  8.83
## 3       0           7    21    13      0      0       0  1.74
## 4       0           6    29    13      0      0       0 10.70
## 5       1           1    41    15      1      1       1  8.83
## 6       1           2    48    20      0      0       1  8.83
## 7       0           9    20    12      0      1       0  7.22
## 8       0          12    27    11      0      1       0  1.20
## 9       1           2    28    16      0      0       1  7.22
## 10      1           0    72    14      0      0       1  4.01
## # ... with 1,155 more rows

Linear SVM

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     5
## 
## - best performance: 0.294 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03 0.315     0.0513
## 2 1e-02 0.315     0.0513
## 3 1e-01 0.300     0.0586
## 4 1e+00 0.296     0.0633
## 5 5e+00 0.294     0.0647
## 6 1e+01 0.294     0.0647
## 7 1e+02 0.294     0.0647
## 
## Call:
## best.tune(method = svm, train.x = vote96 ~ ., data = as_tibble(mh_split$train), 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  5 
##       gamma:  0.125 
## 
## Number of Support Vectors:  513
## 
##  ( 258 255 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

## Area under the curve: 0.777

Polynomial SVM

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.284 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03 0.315     0.0389
## 2 1e-02 0.315     0.0389
## 3 1e-01 0.304     0.0388
## 4 1e+00 0.284     0.0459
## 5 5e+00 0.300     0.0397
## 6 1e+01 0.301     0.0309
## 7 1e+02 0.311     0.0339
## 
## Call:
## best.tune(method = svm, train.x = vote96 ~ ., data = as_tibble(mh_split$train), 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), 
##     kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.125 
##      coef.0:  0 
## 
## Number of Support Vectors:  506
## 
##  ( 263 243 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

## Area under the curve: 0.73

Radial SVM

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##     1
## 
## - best performance: 0.288 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03 0.315     0.0535
## 2 1e-02 0.315     0.0535
## 3 1e-01 0.314     0.0509
## 4 1e+00 0.288     0.0524
## 5 5e+00 0.291     0.0561
## 6 1e+01 0.293     0.0568
## 7 1e+02 0.313     0.0537
## 
## Call:
## best.tune(method = svm, train.x = vote96 ~ ., data = as_tibble(mh_split$train), 
##     ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)), 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.125 
## 
## Number of Support Vectors:  510
## 
##  ( 266 244 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

## Area under the curve: 0.749

Compare SVMs

SVM kernel CV training error rate
Linear 0.294
Polynomial 0.284
Radial 0.288

Logistic regression

## 
## Call:
## glm(formula = vote96 ~ ., family = binomial, data = as_tibble(mh_split$train))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.508  -1.049   0.534   0.855   1.948  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.00019    0.60162   -6.65  3.0e-11 ***
## mhealth_sum -0.08478    0.02795   -3.03   0.0024 ** 
## age          0.04332    0.00585    7.41  1.3e-13 ***
## educ         0.20305    0.03439    5.90  3.5e-09 ***
## black1       0.11542    0.23625    0.49   0.6252    
## female1      0.20224    0.16607    1.22   0.2233    
## married1     0.21052    0.18544    1.14   0.2563    
## inc10        0.06051    0.03143    1.93   0.0542 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1016.74  on 815  degrees of freedom
## Residual deviance:  876.12  on 808  degrees of freedom
## AIC: 892.1
## 
## Number of Fisher Scoring iterations: 4

## Area under the curve: 0.778

The test error rate for the logistic regression model is 0.261.

Decision tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 816 1000 1 ( 0 1 )  
##    2) age < 48.5 516  700 1 ( 0 1 )  
##      4) educ < 12.5 191  300 0 ( 1 0 )  
##        8) educ < 11.5 55   60 0 ( 1 0 ) *
##        9) educ > 11.5 136  200 1 ( 0 1 ) *
##      5) educ > 12.5 325  400 1 ( 0 1 )  
##       10) mhealth_sum < 4.5 259  300 1 ( 0 1 ) *
##       11) mhealth_sum > 4.5 66   90 0 ( 1 0 ) *
##    3) age > 48.5 300  300 1 ( 0 1 )  
##      6) educ < 12.5 153  200 1 ( 0 1 )  
##       12) inc10 < 1.08335 43   60 1 ( 0 1 ) *
##       13) inc10 > 1.08335 110  100 1 ( 0 1 ) *
##      7) educ > 12.5 147   60 1 ( 0 1 ) *

## Area under the curve: 0.576

The test error rate for the decision tree model is 0.315.

Bagging

## 
## Call:
##  randomForest(formula = vote96 ~ ., data = as_tibble(mh_split$train),      mtry = 7) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 32.4%
## Confusion matrix:
##    0   1 class.error
## 0 89 168       0.654
## 1 96 463       0.172

## Area under the curve: 0.714

Random forest

## 
## Call:
##  randomForest(formula = vote96 ~ ., data = as_tibble(mh_split$train)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 30.4%
## Confusion matrix:
##    0   1 class.error
## 0 83 174       0.677
## 1 74 485       0.132

## Area under the curve: 0.743

Comparison

  • SVM (linear kernel)
  • Logistic regression
  • Decision tree
  • Bagging (\(n = 500\))
  • Random forest (\(n = 500, m = \sqrt{p}\))